Yue Wang1,2, Timothy W Randolph1,3, Ali Shojaie2, Jing Ma4,5. 1. Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA. 2. Department of Biostatistics, University of Washington, Seattle, Washington, USA. 3. Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA. 4. Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA jingma@fredhutch.org. 5. Department of Statistics, Texas A&M University, College Station, Texas, USA.
Abstract
Exploratory analysis of human microbiome data is often based on dimension-reduced graphical displays derived from similarities based on non-Euclidean distances, such as UniFrac or Bray-Curtis. However, a display of this type, often referred to as the principal-coordinate analysis (PCoA) plot, does not reveal which taxa are related to the observed clustering because the configuration of samples is not based on a coordinate system in which both the samples and variables can be represented. The reason is that the PCoA plot is based on the eigen-decomposition of a similarity matrix and not the singular value decomposition (SVD) of the sample-by-abundance matrix. We propose a novel biplot that is based on an extension of the SVD, called the generalized matrix decomposition biplot (GMD-biplot), which involves an arbitrary matrix of similarities and the original matrix of variable measures, such as taxon abundances. As in a traditional biplot, points represent the samples, and arrows represent the variables. The proposed GMD-biplot is illustrated by analyzing multiple real and simulated data sets which demonstrate that the GMD-biplot provides improved clustering capability and a more meaningful relationship between the arrows and points.IMPORTANCE Biplots that simultaneously display the sample clustering and the important taxa have gained popularity in the exploratory analysis of human microbiome data. Traditional biplots, assuming Euclidean distances between samples, are not appropriate for microbiome data, when non-Euclidean distances are used to characterize dissimilarities among microbial communities. Thus, incorporating information from non-Euclidean distances into a biplot becomes useful for graphical displays of microbiome data. The proposed GMD-biplot accounts for any arbitrary non-Euclidean distances and provides a robust and computationally efficient approach for graphical visualization of microbiome data. In addition, the proposed GMD-biplot displays both the samples and taxa with respect to the same coordinate system, which further allows the configuration of future samples.
Exploratory analysis of human microbiome data is often based on dimension-reduced graphical displays derived from similarities based on non-Euclidean distances, such as UniFrac or Bray-Curtis. However, a display of this type, often referred to as the principal-coordinate analysis (PCoA) plot, does not reveal which taxa are related to the observed clustering because the configuration of samples is not based on a coordinate system in which both the samples and variables can be represented. The reason is that the PCoA plot is based on the eigen-decomposition of a similarity matrix and not the singular value decomposition (SVD) of the sample-by-abundance matrix. We propose a novel biplot that is based on an extension of the SVD, called the generalized matrix decomposition biplot (GMD-biplot), which involves an arbitrary matrix of similarities and the original matrix of variable measures, such as taxon abundances. As in a traditional biplot, points represent the samples, and arrows represent the variables. The proposed GMD-biplot is illustrated by analyzing multiple real and simulated data sets which demonstrate that the GMD-biplot provides improved clustering capability and a more meaningful relationship between the arrows and points.IMPORTANCE Biplots that simultaneously display the sample clustering and the important taxa have gained popularity in the exploratory analysis of human microbiome data. Traditional biplots, assuming Euclidean distances between samples, are not appropriate for microbiome data, when non-Euclidean distances are used to characterize dissimilarities among microbial communities. Thus, incorporating information from non-Euclidean distances into a biplot becomes useful for graphical displays of microbiome data. The proposed GMD-biplot accounts for any arbitrary non-Euclidean distances and provides a robust and computationally efficient approach for graphical visualization of microbiome data. In addition, the proposed GMD-biplot displays both the samples and taxa with respect to the same coordinate system, which further allows the configuration of future samples.
A biplot simultaneously displays, in two dimensions, rows (samples) of a data matrix as points and columns (variables) as arrows. Based on a matrix decomposition of the data matrix, the biplot is a useful graphical tool for visualizing the structure of large data matrices. It displays a dimension-reduced configuration of samples, as in a principal-coordinate analysis plot, and the variables with respect to the same set of coordinates. If meaningful sample groupings are observed, this allows visualizing which variables contribute most to the separation. The traditional biplot, as first introduced in reference 1, displays the first two left and right singular vectors of the singular value decomposition (SVD) of the data matrix as points and arrows, respectively. This biplot, which we hereafter refer to as the SVD-biplot, uses the SVD to find the optimal least-square representation of the data matrix in a low-dimensional space. The SVD-biplot can show Euclidean distances between samples and also display approximate variances and correlations of the variables. It also has the appealing property that the singular values obtained from the SVD are nonincreasing, indicating that the decomposition of the total variance of the data matrix into each dimension is nonincreasing.In many scenarios, the Euclidean distance may not be optimal for characterizing dissimilarities between samples. An important example arises in the analysis of microbiome data, in which marker gene sequences (e.g., 16S rRNA) are often grouped into taxonomic categories using bioinformatic pipelines such as QIIME (2) or mothur (3). These taxon counts can be summarized into a data matrix with rows and columns representing samples and taxon abundances, respectively. A variety of non-Euclidean distance measures, including nonlinear measures, are then used to quantify the similarity between samples. One common measure of dissimilarity is the UniFrac distance (weighted or unweighted), which is a function of the phylogenetic dissimilarity of a pair of samples (4, 5). Other nonphylogenetic, non-Euclidean dissimilarities include Jaccard or Bray-Curtis distances (see, e.g., reference 6 and the references therein). Plotting the samples in the space of the first few principal components (PCs) of the similarity matrix obtained from such non-Euclidean distance matrices—often referred to as principal-coordinate analysis (PCoA)—may reveal an informative separation between samples. However, the configuration of samples yielded by PCoA keeps only pairwise distances between samples and lacks a coordinate system that relates to the taxa that constitute each sample. Hence, it does not shed any light on which taxa may play a role in this separation. One approach for addressing this problem is to simply plot an arrow for each taxon based on its correlation with the first two PCs of the non-Euclidean similarity matrix (7). However, in such a “joint plot” (8), the direction and length of an arrow does not represent the taxon’s true contribution to the dissimilarity between samples. In addition, due to the lack of a coordinate system, one cannot add sample points for future observations into this “joint plot.”Three main approaches have been recently proposed to extend the SVD-biplot to more general distances defined on the samples. The R package “ade4” (9) provides a biplot that can handle weighted Euclidean distances but it cannot handle non-Euclidean distances. The second approach, proposed by Greenacre (10), aims to approximate the non-Euclidean distance by a weighted Euclidean distance. Weights are estimated for variables, and the biplot can subsequently be constructed using weighted least-square approximation of the matrix. This approach has a straightforward interpretation. However, the estimated weighted Euclidean distance may not capture all the information from the original non-Euclidean distance. A recent proposal in reference 11 appears to be the first to address the lack of mathematical duality between the samples’ locations (points) and the variables’ contribution (arrows) to those locations. This approach seeks an approximate SVD-like decomposition of the data matrix, which directly takes the non-Euclidean distance into consideration. This SVD-like decomposition has the following two advantages. First, the left singular vectors are the eigenvectors of the similarity measure derived from the non-Euclidean distance, which preserve the role of the non-Euclidean distance in classifying the samples. Second, an approximate matrix duality (AMD) between the left and right singular vectors is restored, which simply means that each set of vectors can be immediately obtained from the other. To emphasize this connection, we hereafter refer to this decomposition as the AMD. Unfortunately, the AMD also suffers from two drawbacks. First, the AMD is only an approximate decomposition of the data matrix, and hence may not capture all the variation of the original data. In particular, the configuration of samples displayed in an AMD-biplot is independent of the data matrix, since the left singular vectors of the AMD depend only on the non-Euclidean distances. Ignoring the data matrix for classifying samples seems nonintuitive since the data matrix is typically assumed to contain some information on the sample similarities. Second, the AMD may result in nondecreasing “singular values.” While these seem like minor technical issues, the second drawback can have important practical implications: which of the left and right singular vectors should be displayed in the resulting biplot? The authors of reference 11 suggest constructing the AMD-biplot based on the two left and right singular vectors that correspond to the two largest singular values. This AMD-biplot assures that the arrows for variables are as meaningful as possible, but they may fail to reveal meaningful sample clusters if the information of sample clusters is associated only with the first several left singular vectors. An alternative approach may be to simply display the first and second left and right singular vectors of the AMD (as done for the SVD). Unfortunately, this strategy does not solve the problem either: although we may observe meaningful sample clusters, the arrows may not be meaningful due to the small singular values. There is thus a lack of clarity regarding which singular vectors should be used to construct the AMD-biplot.The drawbacks of the AMD-biplot motivate our proposal which is based on the generalized matrix decomposition biplot (GMD-biplot) (12). The GMD-biplot is a direct generalization of the SVD-biplot that accounts for structural dependencies among the samples and/or variables. This approach has several advantages. First, as with the AMD, it directly handles any non-Euclidean distance matrix. Specifically, the full information from that distance matrix is used. Second, unlike the AMD, which provides an approximate decomposition of the data matrix, the GMD provides an exact decomposition of the original data matrix without losing any information. Third, the GMD restores the matrix duality in a mathematically rigorous manner, unlike the approximate matrix duality obtained with the AMD; it naturally extends the duality inherent in the SVD and allows one to plot both the configuration of samples and the contribution of individual variables with respect to a new coordinate system. Fourth, the GMD gives nonincreasing GMD values, so the resulting GMD-biplot can be directly constructed based on the first two left and right GMD vectors. Last, unlike the AMD-biplot whose sample clusters depend only on distance, the GMD-biplot uses both the non-Euclidean distance and the data matrix for classifying samples, which more directly connects the contribution of the individual variables to the configuration of samples. Additionally, besides accounting for the non-Euclidean distances between samples, the GMD can also account for auxiliary information on (dis)similarities between the variables.The remainder of this paper is organized as follows. We first illustrate the GMD-, AMD-, and SVD-biplots in three numerical studies. We then discuss advantages of the proposed GMD-biplot and further extensions. In Materials and Methods, we present detailed description of the GMD-biplot framework.
RESULTS
In the results below, we compare the GMD-, AMD-, and SVD-biplots on three data sets in the manner that each has been proposed recently for microbiome data. In particular, in reference 11, the AMD-biplot is advocated specifically for relative abundance data, while in reference 13, the SVD-biplot is advocated for data that have been scaled by the centered log ratio (CLR) transformation. The GMD-biplot is constructed using the CLR-transformed data. We first examine the performance of all biplots using the smokeless tobacco data set explored in reference 11. In the second study, we compare their performances using the human gut microbiome data from reference 14. In the third analysis, we simulate a data set based on the smokeless tobacco data to illustrate a dilemma that the AMD-biplot may face.
Analysis of the smokeless tobacco data.
This data set includes 15 smokeless tobacco products: 6 dry snuff samples, 7 moist snuff samples, and 2 toombak samples from Sudan. Three separate (replicate) observations (starting with sample preparation) were made of each product, so that a total of 45 observations are available. Each observation has a 271 × 1 vector of taxon counts, and thus, the data set can be formed into a 45 × 271 matrix, denoted by . The squared weighted UniFrac distance, denoted by , was used to measure the distance between samples. The corresponding similarity kernel was calculated as , where is the centering matrix and 1 is an n × 1 vector. Since is not positively semidefinite, we forced it to be positive semidefinite by removing its negative eigenvalues and corresponding eigenvectors. The resulting similarity kernel, denoted , has rank 27.For the GMD-biplot, we consider the CLR transformation of . Specifically, denoting the geometric mean of a vector by , the CLR transformation of is given by
We denote the resulting data matrix by . For the AMD-biplot, we converted each row of into the empirical frequencies and further centered the rows and columns to have mean 0, as done in reference 11. We denote the resulting data matrix by .We constructed the GMD-biplot and the AMD-biplot based on using and , respectively. Figure 1d displays the proportion of variance captured by each GMD component. It can be seen that the first two GMD components capture more than 80% of the total variance of , which assures that the resulting GMD-biplot (Fig. 1a) visualizes the data well. As shown in Fig. 1a, the GMD-biplot is perfectly successful at separating the different tobacco products (dry, moist, and toombak). Furthermore, the replicates corresponding to the same product are tightly clustered. By examining the arrows for taxa in Fig. 1a, we see that moist samples may be characterized by elevated levels of Alloiococcus and Halophilus, while Aerococcaceae appears elevated in toomback samples.
FIG 1
Biplots and scree plots for the analysis of smokeless tobacco data. (a) GMD-biplot based on the first and second components; (b) AMD-biplot based on the second and third components; (c) SVD-biplot based on the first and second components; (d) GMD scree plot; (e) AMD scree plot; (f) SVD scree plot. The biplots in panels a, b, and c display the top top taxa with the longest arrows. The sample points are colored by sample type (dry snuff [blue], moist snuff [orange], and toombak [purple]), and samples corresponding to replicates of the same product are plotted with the same symbol (dry snuff [D], moist snuff [M], and toombak [T]). The scree plots in panels d, e, and f display the contributions of the top 10 components. (Panel e is adapted from reference 11.)
Biplots and scree plots for the analysis of smokeless tobacco data. (a) GMD-biplot based on the first and second components; (b) AMD-biplot based on the second and third components; (c) SVD-biplot based on the first and second components; (d) GMD scree plot; (e) AMD scree plot; (f) SVD scree plot. The biplots in panels a, b, and c display the top top taxa with the longest arrows. The sample points are colored by sample type (dry snuff [blue], moist snuff [orange], and toombak [purple]), and samples corresponding to replicates of the same product are plotted with the same symbol (dry snuff [D], moist snuff [M], and toombak [T]). The scree plots in panels d, e, and f display the contributions of the top 10 components. (Panel e is adapted from reference 11.)Figure 1e, which is the same as the right bottom panel of Fig. 1 in reference 11, shows that the AMD singular values are not necessarily decreasing. It should be noted that Fig. 1b is slightly different from Fig. 3 in reference 11; this difference may be due to the use of here as opposed to in reference 11. This is because we wanted the AMD-biplot to be directly comparable to the GMD-biplot, since the GMD requires both and to be positive semidefinite. From Fig. 1b, it can be seen that the AMD-biplot successfully separates toombak samples (purple points) from dry (blue) and moist (orange) snuff samples, although the separation between dry and moist snuff samples in the AMD-biplot is not as definitive as that in the GMD-biplot (Fig. 1a).Additionally, we included the SVD-biplot and its corresponding scree plot in Fig. 1c and f, respectively. As the SVD-biplot assumes Euclidean distances between samples, it is more appropriate to construct the SVD-biplot using the CLR-transformed data than the relative abundance data (13). It can be seen from Fig. 1c that although the SVD-biplot successfully separates dry snuff from moist and toombak samples, it does not give a clear separation between moist snuff and toombak samples.It is worth noting that the three biplots identify different top taxa, i.e., the taxa with the longest arrows. Although a biplot is not a rigorous statistical method to detect important taxa, it may shed light on which taxa are important to the observed sample clustering. To see this, we performed a univariate linear regression of each taxon (each column of ) on the tobacco groups (dry, moist, and toombak) and obtained P values representing the strength of association between each taxon and the tobacco groups. We then sorted these P values in a nondecreasing order and obtained the rank of each taxon based on the sorted P values. Hence, it is desirable that the taxa with the lowest ranks can be identified by the biplots. Table S1 in the supplemental material summarizes the ranks of the top 10 taxa identified by each biplot. It can be seen that the top 10 taxa identified by the GMD-biplot have lower ranks on average than those identified by the AMD and SVD biplots, indicating that the GMD-biplot may identify more meaningful taxa with respect to the separation of the samples than the AMD and SVD biplots.Ranks of the top 10 taxa identified by the GMD-, AMD-, and SVD-biplots in the analysis of smoke tobacco data. Download Table S1, PDF file, 0.02 MB.
Analysis of human gut microbiome data.
We consider the human gut microbiome data collected in a study of healthy children and adults from the Amazonas of Venezuela, rural Malawi, and U.S. metropolitan areas (14). The original data set consists of counts for 149 taxa for 100 samples. The squared unweighted UniFrac distance matrix , computed using the R package phyloseq (15), was used to measure the distance between samples. Here, the distance between two samples is based entirely on the number of branches they share on a phylogenetic tree. The distance hence accounts only for the presence/absence of each taxon (not its abundance). The corresponding similarity kernel was then derived as , which is a positive semidefinite matrix with rank 99. Let and , respectively, denote the CLR-transformed data and the relative abundance data. Similar to the first study, the GMD-biplot and the AMD-biplot were constructed based on the similarity kernel using and respectively, and the SVD-biplot was constructed based on the SVD of .As concluded in reference 14, shared features of the functional maturation of the gut microbiome are identified during the first 3 years of life. We thus define a binary outcome h based on the age of the individual (in years) when each sample was taken as:for
. Approximately 70% of the samples are assigned to group 0, and the remaining 30% are assigned to group 1.In all biplots, the ith sample is colored by age and symbolized by h. Figure 2d indicates that more than 80% of the total variance is explained by the GMD-biplot in Fig. 2a, which provides a good visualization of sample clusters across age. By examining the relationship between the arrows and the color of the sample points in Fig. 2a, we see that Prevotella may be elevated in adults, while Parabacteroides appears to be elevated in infants. In contrast, Fig. 2e shows that less than 15% of the total variance is explained by the AMD-biplot in Fig. 2b and the AMD values are not decreasing. As shown in Fig. 2b, the AMD-biplot also displays potential clusters across age, but the sample points are not as tightly clustered as those in Fig. 2a. Odoribacter appears to be elevated in adults in Fig. 2b, while Lactobacillus appears associated with infants. As a reference, Fig. 2c shows the SVD-biplot of , which looks very similar to Fig. 2a.
FIG 2
Biplots and scree plots for the analysis of the human gut microbiome data. (a) GMD-biplot based on the first and second GMD components; (b) AMD-biplot based on the first and second components; (c) SVD-biplot based on the first and second SVD components; (d) GMD scree plot; (e) AMD scree plot; (f) SVD scree plot. The biplots in panels a, b, and c display the top three taxa with the longest arrows. Symbols of sample points are based on the ages of individuals when the samples were collected (age years indicated by circles, and age > 3 years indicated by triangles). The scree plots in panels d, e, and f display the contributions of the top 10 components.
Biplots and scree plots for the analysis of the human gut microbiome data. (a) GMD-biplot based on the first and second GMD components; (b) AMD-biplot based on the first and second components; (c) SVD-biplot based on the first and second SVD components; (d) GMD scree plot; (e) AMD scree plot; (f) SVD scree plot. The biplots in panels a, b, and c display the top three taxa with the longest arrows. Symbols of sample points are based on the ages of individuals when the samples were collected (age years indicated by circles, and age > 3 years indicated by triangles). The scree plots in panels d, e, and f display the contributions of the top 10 components.To further quantify the classification accuracy, for each biplot, we predicted the probability that each sample belongs to group 1 based on leave-one-out cross validation using the binary logistic regression of the group index h on the two selected components. We then plotted a receiver operating characteristic (ROC) curve for each biplot based on the predicted probabilities (see Fig. S1 in the supplemental material) and calculated the area under the ROC curve (AUC): the GMD-, AMD-, and SVD-biplots yield an AUC of 0.989, 0.976, and 0.990, respectively. The AUC results indicate that the GMD-biplot provides a better separation of age groups than the AMD-biplot, but there is not a clear difference between the GMD-biplot and the SVD-biplot. This may be because, for the CLR-transformed data , the unweighted UniFrac distance is not as informative with respect to age as the weighted UniFrac distance was in the tobacco data with respect to product groups.ROC curves for the GMD-, AMD-, and SVD-biplots in the analysis of human gut microbiome data. Download FIG S1, TIF file, 1.2 MB.We emphasize that both the GMD-biplot and the SVD-biplot identify Prevotella and Parabacteroides as top taxa, while the AMD-biplot identifies completely different ones. As reference 14 confirms that the trade-off between Prevotella and Bacteroides (including Parabacteroides) considerably drives the variation of microbiome abundance in adults and children between 0.6 and 1 year of age in all studied populations, the GMD- and SVD-biplots may thus identify more biologically meaningful taxa than the AMD-biplot. It should, however, be noted that these bacterial are “identified” based on circumstantial, not statistical, evidence, and more work is needed to examine statistical associations in this context.
Incorporating a kernel for variables into the GMD-biplot.
The GMD problem (see equation 3 in Materials and Methods) allows not only the similarity kernel for samples but also a kernel for the variables. Including both kernels may further improve the accuracy of sample classification as well as the identification of important variables. We illustrate this advantage by incorporating a kernel for variables in the analysis of the human gut microbiome data. More specifically, we first calculate a matrix of squared patristic distances between the tips of the phylogenetic tree for each pair of taxa and then derive a similarity matrix as . Figure 3a shows the GMD-biplot with the additional kernel incorporated. The ROC analysis based on the leave-one-out cross validation for Fig. 3a yields an AUC of 0.984, which is higher than that of the AMD-biplot (Fig. 2b) but slightly lower than Fig. 2a and Fig. 2c. This may be because both and highly depend on the phylogenetic tree. Thus, incorporating may be redundant and may reduce the accuracy of the sample clustering in this case. The top three taxa identified in Fig. 3a include Prevotella but not Parabacteroides, which may explain the lower clustering accuracy.
FIG 3
The biplot and scree plot for the analysis of the human gut microbiome data using both and . (a) GMD-biplot using both and based on the first and second GMD components. The top three taxa with the longest arrows are displayed. Symbols of sample points are based on the ages of the individuals when the samples were collected (age years indicated by circles, and age > 3 years indicated by triangles). (b) GMD scree plot using both and . The contributions of top 10 components are displayed.
The biplot and scree plot for the analysis of the human gut microbiome data using both and . (a) GMD-biplot using both and based on the first and second GMD components. The top three taxa with the longest arrows are displayed. Symbols of sample points are based on the ages of the individuals when the samples were collected (age years indicated by circles, and age > 3 years indicated by triangles). (b) GMD scree plot using both and . The contributions of top 10 components are displayed.Including an additional kernel for variables in the GMD-biplot is related to the method of double-principal-coordinate analysis (DPCoA) (16). DPCoA, as shown in reference 17, is equivalent to a generalized PCoA which essentially incorporates an additional similarity kernel for variables into the analysis, as described in Proposition 1, but for . As suggested in reference 18, DPCoA can provide a biplot representation of both samples and meaningful taxonomic categories. Hence, the GMD-biplot can also be viewed as an extension of DPCoA biplots because the GMD allows kernels for both samples and variables, while DPCoA allows a kernel only for variables.
Simulation.
In this section, we conduct a simulation study based on the smokeless tobacco data to illustrate a scenario in which the AMD-biplot may fail to separate the samples, whereas the GMD-biplot performs well. Let and be the similarity kernel and data matrix from the smokeless tobacco data, respectively. We consider the eigen-decomposition of as : is a 45 × 27 matrix whose columns are eigenvectors of and is a diagonal matrix whose elements are the eigenvalue of . Then, the AMD-biplot is based on the following approximated orthogonal decomposition of :
where and is a 271 × 27 matrix with orthonormal columns. As shown in Fig. 2d, may not be decreasing. For , we define
and construct the simulated data set as , where . For , we define a binary outcome w that indicates the group index of the ith sample as:The GMD-biplot and the AMD-biplot of with similarity measure are presented in Fig. 4a and b, respectively. It can be seen that the two groups are completely mixed up in the AMD-biplot because the first column of is not selected for visualization. In contrast, the GMD-biplot successfully visualizes the sample groups by displaying the first and second GMD components.
FIG 4
Biplots for the analysis of the simulated data. (a) GMD-biplot based on the first and second GMD components; (b) AMD-biplot based on the second and third components. Both biplots display the top six taxa with the longest arrows. The sample points are colored by the group index (1 [red] or 0 [black]).
Biplots for the analysis of the simulated data. (a) GMD-biplot based on the first and second GMD components; (b) AMD-biplot based on the second and third components. Both biplots display the top six taxa with the longest arrows. The sample points are colored by the group index (1 [red] or 0 [black]).To see why this occurs, we summarize the first three diagonal elements of , , and in Table 1 and notice that Consequently, the AMD-biplot displays the second and third columns of and hence, it completely fails to classify the samples because the group index w depends only on the first column of . In contrast, Proposition 1a (see Materials and Methods) shows that the GMD-biplot is based on the two largest eigenvalues and the corresponding eigenvectors of . It can be further seen that
TABLE 1
The first three diagonal elements of , , and in the simulation
Matrix
Value for the following diagonal elements:
1st
2nd
3rd
DS
0.6
0.8
1
Λ
3.09
1.26
0.77
DS2Λ
1.11
0.81
0.77
The first three diagonal elements of , , and in the simulationEquation 2 implies that the diagonal elements of are the eigenvalues of and columns of are the corresponding eigenvectors. Hence, it can be seen from Table 1 that , even though . Therefore, the GMD-biplot displays the first and second column of as sample points, which successfully captures sample classifications.
DISCUSSION
Biplots have gained popularity in the exploratory analysis of high-dimensional microbiome data. The traditional SVD-biplot is based on Euclidean distances between samples and cannot be directly applied when more general dissimilarities are used. Since Euclidean distances may not lead to an optimal low-dimensional representation of the samples, we have extended the concept of the SVD-biplot to allow for more general similarity kernels. The phylogenetically informed UniFrac distance, used in our examples, defines one such kernel. In settings where a general (possibly nonlinear) distance matrix is appropriate, our approach provides a mathematically rigorous and computationally efficient method, based on the GMD, that allows for plotting both the samples and variables with respect to the same coordinate system.Our first data example with the smokeless tobacco data set from reference 11 demonstrates the merits of the proposed GMD-biplot. We found that the GMD-biplot successfully displays different types of products, while the AMD-biplot is not able to completely separate dry and moist snuff samples, and the SVD-biplot fails to separate moist and toombak samples. As shown in Table S1 in the supplemental material, the GMD-biplot is also able to identify biologically more meaningful taxa that are related to the different types of products, compared to the AMD-biplot and the SVD-biplot.In our second example, the GMD-biplot also outperforms the AMD-biplot in terms of both the sample clustering and the identification of important taxa. However, there is not a clear advantage of the GMD-biplot over the SVD-biplot in this example. This difference between the two examples may be attributed to the relation between the Euclidean kernel and the non-Euclidean similarity measure. Denoting the Euclidean kernel and the non-Euclidean similarity measure by and , respectively, it can be seen that the sample configuration in the AMD-biplot and the SVD-biplot depend solely on either or , whereas the GMD-biplot uses the top two eigenvectors of the matrix product of the Euclidean kernel and . Hence, if contains substantially more information about sample clustering than , then taking into consideration may not further improve the accuracy of sample clustering. Indeed, this may be the case in our second example, where the clustering of samples using the Euclidean distance between samples of the CLR-transformed data is highly successful because the difference of the microbial profiles between infants and adults is obvious even without the help of the UniFrac distance. However, a possibly more common scenario is when both and contain some, but different, information on sample clustering. In such cases, taking both and into consideration may improve the sample clustering and provide better biological interpretation.In practice, we typically do not know what the true configuration of samples looks like, so it is impossible to determine whether or contains more information about sample clusters. Also, it is sensible to assume that and are “coinformative” in the sense that they exhibit a shared eigenstructure; for instance, both may be informative for clustering samples. The coinformativeness can be quantified precisely using the Hilbert-Schmidt information criteria (HSIC) (19). For any two kernels and , the empirical HSIC is proportional to . Hence, by definition, the GMD problem in equation 3 (see Materials and Methods) is equivalent to minimizing the HSIC between and over , and . In other words, if we consider as the residual matrix of , then the GMD solutions can be interpreted as the best approximation to in the sense that the HSIC between and the Euclidean kernel of the residual matrix is minimized. Thus, the GMD-biplot considers the coinformativeness of and . Therefore, in many cases, it would be a more robust way to display the sample points compared to the AMD-biplot or the SVD-biplot. Another advantage of the GMD-biplot over the AMD-biplot is illustrated in our simulation study. Since AMD may not give decreasing singular values, the AMD-biplot may not be able to display the most informative eigenvectors of , and may thus fail to cluster the samples. In contrast, GMD assures that the resulting singular values are nonincreasing.Our discussion in this paper has focused on the form biplot, which aims to visualize the relationship between variables and the sample clustering. In other scenarios, where the variation of the data matrix explained by each variable is of particular interest, the covariance biplot may be more appropriate. This biplot considers the GMD of with respect to ; i.e., , where and . Note that where . Furthermore, since has orthogonal columns, it can be seen that . Thus, the value of gives the proportion of the variability in explained by the jth variable. Note that when q = 2, the length of the arrow of the jth variable in the covariance biplot is given by . Therefore, in a covariance biplot, the arrows shed light on how the total variance of the data is partitioned into parts explained by each variable.
MATERIALS AND METHODS
We denote the data matrix by , where n is the number of samples and p is the number of variables (taxa). We assume that the columns of are centered to have mean 0 and . For any matrix , we denote its ith row (sample) and its (i, j) entry by m and m, respectively. We denote the transpose of by .
Biplot, distance measure, and AMD.
A biplot is a graphical method to simultaneously represent, in two dimensions, both the rows (as points) and columns (as arrows) of the matrix on the same coordinate axes. Given a decomposition of as , a biplot displays two selected columns of and . The SVD-biplot is based on the singular value decomposition (SVD) of , i.e., , where and with being a sequence of nonincreasing and positive singular values. Here is a rank K identity matrix. Based on the SVD, and can take various forms; examples include form and covariance biplots (7). Since our primary interest is to visualize the clustering of samples, we focus on the form biplot in this paper and comment on the covariance biplot in the Discussion.The SVD-biplot displays the first two columns of and , which can explain of the total variance of . The SVD of is closely related to the eigen-decomposition of the similarity kernel , as we can write . Thus, the eigen-decomposition of provides a way to calculate and . Once and are calculated, one can calculate from the duality between and ; that is, . The similarity kernel characterizes the Euclidean distance between samples. To see this, we define the Euclidean squared distance between the ith and jth sample as . Let be the centering matrix where is an n × 1 vector of ones. It can then be seen that , where the (i, j) entry of is .Now, if is replaced by a matrix of non-Euclidean squared dissimilarities, one can still define a similarity kernel by . One such example is when arises from distances between sample vectors of microbial abundances (or presence/absence) which account for a phylogenetic tree, as in a weighted (respectively, unweighted) UniFrac distance matrix. In this case, one can construct a principal-coordinate analysis (PCoA) plot of the samples using . However, an SVD-biplot cannot be constructed, since there is no that corresponds to the variables. The approximate matrix duality (AMD) addresses this problem by fixing and then seeking a matrix with orthonormal columns and a diagonal matrix with nonnegative elements that minimize the objective functionHere, is the Frobenius norm of , and for any square matrix . The resulting AMD-biplot can be constructed by plotting the two columns of as sample points and plotting as arrows for variables; the selected two columns/rows correspond to the two largest elements of .
GMD and the GMD-biplot.
The concept of generalized matrix decomposition (GMD) was introduced by Escoufier (20) and further developed in reference 12. It is a generalization of the SVD with additional structural dependencies taken into consideration. We briefly review the key ideas behind the GMD. Let and be two positive semidefinite matrices, which characterize the similarities between samples and between variables, respectively. The -norm of is defined as . For any , the GMD solution finds the best rank-q approximation to with respect to the -norm, that is,subject to and . Here, and are the left and right GMD vectors, respectively, and is a diagonal matrix containing the GMD values. Note that and are orthogonal with respect to and , respectively, but they may not be orthogonal with respect to the Euclidean norm unless and . The following property of the GMD provides a way to calculate the GMD components; the proof can be found in reference 20.Proposition 1: The GMD solutions satisfy:
Proposition 1a suggests that the diagonal elements of and corresponding columns of are eigenvalues and corresponding eigenvectors of , respectively. Proposition 1b establishes the duality between and , meaning that can be immediately obtained, given and . Alternatively, an efficient algorithm for finding the solution to equation 3 was proposed in reference 12, which is less computationally intensive compared to finding the eigenvalues and eigenvectors of . The algorithm also ensures that the diagonal elements of are nonincreasing.Note that the GMD can handle the non-Euclidean similarity kernel just by taking . Based on the GMD of with respect to , the GMD-biplot can be constructed with respect to the coordinate system provided by the first two columns of . More specifically, letting v be the jth column of , the ith sample point can be configured by the coordinates of x, given by . To plot the arrow for the jth variable, we consider the vector , which has a 1 in the jth element and 0’s elsewhere. Then, the arrow for the jth variable can be configured by the coordinates of e, given by . This coordinate system also allows the configuration of future samples. Letting be a future sample, we can add into the GMD-biplot as a point located at . Similar to the SVD-biplot, the GMD-biplot can explain of the total variance of with respect to the norm, where is the kth diagonal element of for k = 1, 2.Since the GMD values are nonincreasing, for the purpose of constructing the GMD-biplot, we can choose q = 2 in the GMD problem (equation 3), which may save considerable computational time. In contrast, since the AMD may produce nondecreasing “singular values,” we have to find the full decomposition of by the AMD before deciding which singular vectors to plot in the AMD-biplot; this may become computationally intensive for large n and p.
Data availability.
All data used are publicly available in references 11 and 14. All computations are conducted in the R programming language, and the proposed biplot is implemented in our R package “GMDecomp,” available at https://github.com/taryue/GMDecomp.
Authors: Patrick D Schloss; Sarah L Westcott; Thomas Ryabin; Justine R Hall; Martin Hartmann; Emily B Hollister; Ryan A Lesniewski; Brian B Oakley; Donovan H Parks; Courtney J Robinson; Jason W Sahl; Blaz Stres; Gerhard G Thallinger; David J Van Horn; Carolyn F Weber Journal: Appl Environ Microbiol Date: 2009-10-02 Impact factor: 4.792
Authors: Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon Journal: Nature Date: 2012-05-09 Impact factor: 49.962
Authors: Yang Sean Xiao; Bo Zhou; Zhuangzhuang Han; Shenzhou Liu; Can Ding; Feifei Jia; Wenzhi Zeng Journal: Front Plant Sci Date: 2022-08-25 Impact factor: 6.627
Authors: Maria Rita Perrone; Salvatore Romano; Giuseppe De Maria; Paolo Tundo; Anna Rita Bruno; Luigi Tagliaferro; Michele Maffia; Mattia Fragola Journal: Int J Environ Res Public Health Date: 2022-08-16 Impact factor: 4.614