| Literature DB >> 29145425 |
Kumar P Mainali1, Sharon Bewick1, Peter Thielen2, Thomas Mehoke2, Florian P Breitwieser3, Shishir Paudel4, Arjun Adhikari5, Joshua Wolfe2, Eric V Slud6, David Karig2, William F Fagan1.
Abstract
Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa.Entities:
Mesh:
Year: 2017 PMID: 29145425 PMCID: PMC5689832 DOI: 10.1371/journal.pone.0187132
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of co-absent site percentages from different macroecological studies and our current microbiome study.
| Taxon | Number of Taxa | Number of Sites | Co-absent Percent (Median) | Reference |
|---|---|---|---|---|
| small mammals | 11 | 14 | 50% | [ |
| birds | 93 | 42 | 52% | [ |
| lizards | 5 | 42 | 55% | [ |
| seed plants | 1815 | 26 | 58% | [ |
| butterflies | 335 | 81 | 64% | [ |
| fish | 452 | 13 | 69% | [ |
| amphibians | 104 | 11 | 73% | [ |
Fig 1If 1 and 0 represent present and absent states of a species, this yields four possible combinations of these states for two species: co-presence (a in figure), mutual-exclusion (b and c), co-absence (d).
Fig 2Two examples of species pairs that are completely uncorrelated spatially that are incorrectly identified by the standard null model of Jaccard’s index [6] as exhibiting negative (a) and positive (b) correlation. Probability theory indicates that two events are independent if their joint probability is the product of marginal probabilities (also indicated by Chi square statistic). In agreement with probability theory, Veech’s null model for co-occurrence analysis [45,56] and our simulated, prevalence-specific null distribution place the observed J right at the center of the null distribution. However, the standard null model assigns an extremely low probability for the observed J given the null model, making it invalid for statistical inference of J.
Fig 3A comparison of Pearson’s correlation coefficient (r, also called the phi coefficient) and Jaccard’s index of similarity (J) for 844,350 species-pairs.
(a) The similarity indices of all species-pairs, plotted in J by r plot (each pair represented by a circle), were evaluated against a familywise error rate of 5% (alpha for each hypothesis testing = 0.05/844350). Quadrant boundaries (red horizontal and green vertical lines) correspond to statistical independence for the two metrics and separate the bivariate plot into four quadrants that differ in correlation directionality. Species-pairs significant for J vs r are distinguished with different colors (“sig.” = significant; “n.s.” = not significant). All the sig. r but n.s. J pairs (gold) are hidden behind sig. r and sig. J pairs (orange). With a stringent alpha of 0.05/844350, a hard-to-notice difference in percentile of J makes a difference in whether it is significant or not. (b) For both J and r, all significant pairs are positive. J predicts 66.4% of all species-pairs to be significantly positive whereas r predicts only 48% significant positive. (c) Significant correlations for r and J in panel (a) are similar. The shaded regions, and the corresponding proportions, characterize the distribution of species pairs across quadrants. (d) Venn diagram illustrating that J and r selected many different species-pairs as significant, with only 56.8% of all the species pairs significant for r or J being significant for both metrics. 14% of the species pairs significant for r were not significant for J and 37.4% of the species pairs signifcant for J were not significant for r.
Fig 4Number of species pairs identified as significant by J and r as a function of species prevalence.
The prevalence of two species in a given pair are shown on the two axes of the grids. After binning the prevalence at 5% interval, the total number of pairs significant in each grid cell was counted. Color scale across plots does not match; gray cell indicate lack of species pairs. Both J (a) and r (b) detect many species pairs significantly correlated (all positive) when at least one of the species in the pair is rare. However, when one of the species is abundant, unlike J, r fails to detect significant pairs (b). The difference in the number of species pairs significant for J and r shows a strong pattern with species prevalence (c). Total number of species pairs in the species prevalence grid is shown in (d). Of 844350 species pairs, 627539 (74.3%) have at least one of the species in the pair very rare (<10% prevalence) whereas 205120 (24.3%) have both species very rare.
Examples of studies that used presence-absence data to compute Jaccard’s similarity index (J) for determining similarity between systems (e.g., between taxa-pairs, between sites, between markets) where the statistical significance of J is faulty and the use of observed value of J as a similarity metric is flawed.
| Study | Probability of | Raw scores of |
|---|---|---|
| [ | Not done | Sites compared based on species composition |
| [ | Not done | Land use types compared based on species composition |
| [ | Color of beach washed plastic and the one in seabird’s gut was compared to assess plastic pollution | |
| [ | Done with [ | two methods for determining diet of white-tailed deer were compared based on plant species |
| [ | Similarity in local environment plastic pollution and ingested plastic in seabirds estimated | |
| [ | Not done | Site similarities estimated based on |
| [ | Information not available | Distributional similarity of species determined by their site-occupancy |
| [ | Not done | Identity of predators was used to calculate food web similarity for many species-pairs and this this similarity was used to estimate phylogenetic signal in the community |
| [ | Not done | Sites were hierarchically clustered based on |
| [ | Not done | Sites were hierarchically clustered based on |
| [ | Done with [ | Bushmeat markets in Africa were compared for their similarity ( |
| [ | Not done | |
| [ | Done with [ | |
| [ | Not done | Various types of forest were compared for their similarities ( |
| [ | Not done | Similarity of two sites ( |
| [ | Not done | Two primate species are compared based on seed of plant species dispersed by the primates |
| [ | Not done | Alpine sites were hierarchically clustered based on similarity ( |
| [ | Done with [ | Distributional similarity between species ( |
| [ | Not done | Species-pair similarity ( |
| [ | Not done | Site similarities estimated based on |
| [ | Information not available | Distributional data was used to determine species-pair similarity ( |
| [ | Not done | Similarity between habitat types ( |
| [ | Not done | Similarity between sites ( |
| [ | Not done | Species-pairs compared for their similarity ( |
| [ | Not done | Similarity between sites ( |
| [ | Done with [ | Similarity between species ( |
| [ | Done with [ | Similarity between sites ( |
| [ | Not done | Similarity between habitat types ( |
| [ | Feed type of horses and germination of invasive species from seeds collected from fecal samples were correlated with | |
| [ | Not done | Site-pairs were compared for their similarity based on composition of bat species |
| [ | Done with [ | Similarity between site-pairs ( |
| [ | Not done | Similarity between site-pairs ( |
| [ | Done with [ | Similarity between geographic units based on species composition was explained by covariates |
| [ | Information not available | Monthly samples of crustecean community were compared and the months were hierarchically clustered based on the similarity ( |
| [ | Done with [ | Identify biogeographic divisions based on species composition similarity of various regions and the hierarchical clustering of the regions |
| [ | Information not available | Fungal communities associated to roots of |
| [ | Not done | Various clinical and environmental isolates of |
| [ | Done with [ | Two strains of Streptococcus pneumoniae were studied for daptomycin-sensitivity; responding genetic network was compared between the strains with |
| [ | Not done | Bacterial communities from two sites were compared with |
| [ | Not done | Similarity ( |
| [ | Not done | Similarity in amplification pattern of various isolates and dendrogram of hierarchical clustering |
Whereas Google Scholar returns over 100,000 publications that include “Jaccard’s” or “Jaccard”, this table includes all the studies that cite Real and Vargas’s paper about the standard null model [6]. Of the 41 studies listed in this table, 24 did not determine the statistical significane of J, 4 lacked enough information to indicate if they determiend the statistical significance, 3 used an artibrary J cutoff to declare significance and 10 determined the probability but with three faulty null models: [6,69,70]. We demonstrate in Fig 2 why the most widely used null model [6] is faulty and discuss why it is faulty in the “Results” and “Discussion” sections. Two other null models for J, i.e. [69,70] are equally faulty because they suffer from the same problems as [6]. Irrespective of the statistical significance, comparing two observed values of J (as was done in every study listed in this table) is incorrect because a given value of J could mean anything from strong positive to strong negative correlation, depending on the species-pair specific null model (see “Results”).