Literature DB >> 30169561

FastSpar: rapid and scalable correlation estimation for compositional data.

Stephen C Watts¹, Scott C Ritchie^2,3,4, Michael Inouye^2,3,4, Kathryn E Holt¹.

Abstract

SUMMARY: A common goal of microbiome studies is the elucidation of community composition and member interactions using counts of taxonomic units extracted from sequence data. Inference of interaction networks from sparse and compositional data requires specialized statistical approaches. A popular solution is SparCC, however its performance limits the calculation of interaction networks for very high-dimensional datasets. Here we introduce FastSpar, an efficient and parallelizable implementation of the SparCC algorithm which rapidly infers correlation networks and calculates P-values using an unbiased estimator. We further demonstrate that FastSpar reduces network inference wall time by 2-3 orders of magnitude compared to SparCC.
AVAILABILITY AND IMPLEMENTATION: FastSpar source code, precompiled binaries and platform packages are freely available on GitHub: github.com/scwatts/FastSpar. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2019 PMID： 30169561 PMCID： PMC6419895 DOI： 10.1093/bioinformatics/bty734

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Microbiome analysis, which aims to assay the bacterial communities present in a given sample set, is important in many fields spanning from human health to agriculture and environmental ecology. The current standard for investigating bacterial community composition is to deep sequence the total genomic DNA or the bacterial 16S rRNA gene and analyze the genetic diversity and abundance within each sample. Unique sequences or sequence clusters are taken to represent operational taxonomic units (OTUs) present in the original sample, and the frequencies of these across samples are summarized in the form of an OTU table (Ju and Zhang, 2015). In many studies, this data is then exploited to construct correlation networks of OTUs spanning sample sets, which can be used to infer or approximate interactions between taxa (He ; Nakatsu ). The calculation of OTU correlation values is complicated by the sparse and compositional nature of the data. OTU counts are typically normalized by dividing each observation within a sample by the total count for that sample, giving a measure of relative abundance. However this transformation introduces dependencies between normalized sample observations, such that calculating simple correlations from the resulting values is not statistically valid (Aitchison, 1982). To perform robust and unbiased statistical analysis of sparse compositional data, it is generally first transformed from the simplex to Euclidean real space. Returning compositional OTU data back to Euclidean real space can be achieved by taking the log ratio of OTU fractions. Utilizing log-ratios restores independence for each OTU and allows components to take on a positive or negative value. Building upon the use of log ratios, Friedman and Alm (2012) articulate an important and robust algorithm, SparCC, to estimate the linear Pearson Correlation between OTUs from variances of log ratios. Given that correlations cannot be calculated directly from log ratio variances, SparCC estimates the correlation statistic by using log ratio variances to approximate the true OTU variance on the assumption that the number of strong correlates is small (Friedman and Alm, 2012). A Python 2 implementation of the SparCC algorithm has been released by the authors with several ancillary scripts for P-value estimation. However, the performance of this implementation precludes analysis of large datasets such as those generated from longitudinal studies (Teo ). Further, the P-value estimator used by SparCC has been demonstrated to be biased and overestimate significance (Phipson and Smyth, 2010). Here we present FastSpar, a fast and parallelizable implementation of the SparCC algorithm with an unbiased P-value estimator. We demonstrate that FastSpar produces equivalent OTU correlations as SparCC while greatly reducing run time and memory consumption on large datasets. We also show that FastSpar has superior performance to the unpublished re-implementations of SparCC available in the mothur and SpiecEasi packages (Supplementary Fig. S1).

2 Implementation

FastSpar is written in C++11, utilizing OpenBLAS and LAPACK via the Armadillo library (Sanderson and Curtin ; Dongarra ; Xianyi ). The GNU Scientific Library (GSL) provides functionality for OTU fraction estimation and threading support is delivered by OpenMP (Dagum and Menon, 1998). In place of the P-value estimator used in SparCC, we utilized an estimator which corrects P-value understatement by considering the possibility of recalling repetitious permutations or original data during testing (Phipson and Smyth, 2010).

3 Results

3.1 Algorithm fidelity

To demonstrate that FastSpar produces equivalent correlations as SparCC, correlation networks were constructed by both programs using random subsets of an OTU table generated from the American Gut Project 16S rRNA sequence data (www.americangut.org), comprising a total of 6068 OTUs and 7523 samples. For each OTU pair, the mean correlation values calculated across 20 replicate runs were near identical for FastSpar and SparCC (Supplementary Figs S2 and S3). The observed OTU correlations calculated by SparCC and FastSpar are not reproduced exactly as there is a degree of randomness in the underlying algorithm. Specifically, OTU fractions are estimated by drawing from a Dirichlet probability distribution (parameterized using sample OTU counts with pseudocounts applied) and are therefore non-deterministic. Hence replicate runs of either program on the same input table produce similar but non-identical results (Supplementary Fig. S2A and B). To allow direct comparison of the algorithms, OTU fractions were pre-computed and provided as an additional input to both SparCC and FastSpar [note that the behaviour of the pseudo-random number generators (PRNG) used by FastSpar (GSL) and SparCC (numpy) differ, thus seeding the PRNGs is insufficient to enable direct comparison]. When using the same pre-computed OTU fractions as input, FastSpar and SparCC returned identical results (Supplementary Fig. S2D). These comparisons can be reproduced by running the code at github.com/scwatts/fastspar_comparison.

3.2 Performance profiling

Performance was compared by running FastSpar and SparCC on random subsets of the American Gut Project OTU table (Fig. 1). Ten random subsets of each combination of sample sizes (n = 250, 500, …, 2500) and OTUs (n = 250, 500, …, 2500) were generated, and subjected to analysis using FastSpar (with and without threading) and SparCC. Wall time and memory usage was recorded using GNU time. The analysis was completed in an Ubuntu 17.04 (Zesty Zapus) chroot environment with the required software packages (Supplementary Table S1). Computation was performed with an Intel(R) Xeon(R) CPU E5-2630 @ 2.30GHz CPU and 62 GB RAM. The performance profiling can be reproduced by running the code at github.com/scwatts/fastspar_timed.

Fig. 1.

Performance profile of FastSpar and SparCC across random subsets of different sizes, extracted from the American Gut Project OTU table. (A) Wall time and (B) memory profiles were recorded using GNU time. (C) Linear models describing FastSpar (single thread) performance metrics with relation to input data dimensions Using 16 threads, FastSpar was up to 821× faster than SparCC, (mean 221× faster; Fig. 1A). Using a single thread, FastSpar was up to 118× faster than SparCC (mean 32× faster; Fig. 1A). The memory usage of FastSpar was up to 116× less than SparCC (mean 26× less; Fig. 1B). Notably the memory performance of SparCC on datasets with more than 1000 OTUs improves considerably and is due to the conditional use of a more memory efficient calculation for the variation matrix (Fig. 1B). This conditional calculation appears to be beneficial for SparCC when analyzing datasets with 500 or fewer OTUs but causes a substantial performance degradation for datasets with 500–1000 OTUs (Supplementary Fig. S4). As expected, both run time and memory principally scale with OTU number rather than sample number (Fig. 1C). For large datasets, it is therefore essential to perform pre-processing of the OTU table in order to reduce the number of OTUs prior to calculating correlations. This can be achieved primarily using two approaches: (i) filtering poorly represented OTUs, or (ii) distribution-based clustering such as that used in dbOTU3. The latter approach aims to reunite OTUs derived from sequencing error with the parent OTU by clustering OTUs based on nucleotide edit distance and count distribution (Preheim ). This has the advantage of retaining count information and thus improving statistical power. To simplify the workflow for large-scale correlation network analyses of microbiome data, FastSpar is packaged with an efficient C++11 implementation of dbOTU3 (github.com/scwatts/otudistclust) that has been optimized for analysis of large datasets by applying concurrency design patterns. FastSpar provides a more robust and efficient method for inferring correlation networks from large microbiome datasets, which was previously intractable yet is likely to become commonplace in modern cohort studies.

Funding

This work was supported by the National Health and Medical Research Council of Australia (Project #1062227, Fellowship #1061409 to K.E.H., Fellowship #1061435 to M.I. co-funded by the Australian Heart Foundation) and by the Australian Government Research Training Program (Scholarship to S.W and S.R.). Conflict of Interest: none declared. Click here for additional data file.

6 in total

1. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.

Authors: Belinda Phipson; Gordon K Smyth
Journal: Stat Appl Genet Mol Biol Date: 2010-10-31

2. Distribution-based clustering: using ecology to refine the operational taxonomic unit.

Authors: Sarah P Preheim; Allison R Perrotta; Antonio M Martin-Platero; Anika Gupta; Eric J Alm
Journal: Appl Environ Microbiol Date: 2013-08-23 Impact factor: 4.792

Review 3. 16S rRNA gene high-throughput sequencing data mining of microbial diversity and interactions.

Authors: Feng Ju; Tong Zhang
Journal: Appl Microbiol Biotechnol Date: 2015-03-27 Impact factor: 4.813

4. Inferring correlation networks from genomic survey data.

Authors: Jonathan Friedman; Eric J Alm
Journal: PLoS Comput Biol Date: 2012-09-20 Impact factor: 4.475

5. Gut mucosal microbiome across stages of colorectal carcinogenesis.

Authors: Geicho Nakatsu; Xiangchun Li; Haokui Zhou; Jianqiu Sheng; Sunny Hei Wong; William Ka Kai Wu; Siew Chien Ng; Ho Tsoi; Yujuan Dong; Ning Zhang; Yuqi He; Qian Kang; Lei Cao; Kunning Wang; Jingwan Zhang; Qiaoyi Liang; Jun Yu; Joseph J Y Sung
Journal: Nat Commun Date: 2015-10-30 Impact factor: 14.919

6. Two distinct metacommunities characterize the gut microbiota in Crohn's disease patients.

Authors: Qing He; Yuan Gao; Zhuye Jie; Xinlei Yu; Janne Marie Laursen; Liang Xiao; Ying Li; Lingling Li; Faming Zhang; Qiang Feng; Xiaoping Li; Jinghong Yu; Chuan Liu; Ping Lan; Ting Yan; Xin Liu; Xun Xu; Huanming Yang; Jian Wang; Lise Madsen; Susanne Brix; Jianping Wang; Karsten Kristiansen; Huijue Jia
Journal: Gigascience Date: 2017-07-01 Impact factor: 6.524

6 in total

45 in total

1. Composition of nasal bacterial community and its seasonal variation in health care workers stationed in a clinical research laboratory.

Authors: Nazima Habibi; Abu Salim Mustafa; Mohd Wasif Khan
Journal: PLoS One Date: 2021-11-24 Impact factor: 3.240

2. Interaction between endometrial microbiota and host gene regulation in recurrent implantation failure.

Authors: Peigen Chen; Lei Jia; Yi Zhou; Yingchun Guo; Cong Fang; Tingting Li
Journal: J Assist Reprod Genet Date: 2022-07-26 Impact factor: 3.357

3. Heat stress impacts the multi-domain ruminal microbiota and some of the functional features independent of its effect on feed intake in lactating dairy cows.

Authors: Tansol Park; Lu Ma; Shengtao Gao; Dengpan Bu; Zhongtang Yu
Journal: J Anim Sci Biotechnol Date: 2022-06-15

4. Bacterial Communities in Concrete Reflect Its Composite Nature and Change with Weathering.

Authors: E Anders Kiledal; Jessica L Keffer; Julia A Maresca
Journal: mSystems Date: 2021-05-04 Impact factor: 6.496

5. Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization.

Authors: Fangting Zhou; Kejun He; Qiwei Li; Robert S Chapkin; Yang Ni
Journal: Biostatistics Date: 2022-07-18 Impact factor: 5.279

6. Airway Microbiota Dynamics Uncover a Critical Window for Interplay of Pathogenic Bacteria and Allergy in Childhood Respiratory Disease.

Authors: Shu Mei Teo; Howard H F Tang; Danny Mok; Louise M Judd; Stephen C Watts; Kym Pham; Barbara J Holt; Merci Kusel; Michael Serralha; Niamh Troy; Yury A Bochkov; Kristine Grindle; Robert F Lemanske; Sebastian L Johnston; James E Gern; Peter D Sly; Patrick G Holt; Kathryn E Holt; Michael Inouye
Journal: Cell Host Microbe Date: 2018-09-12 Impact factor: 21.023

7. The structure of microbial populations in Nelore GIT reveals inter-dependency of methanogens in feces and rumen.

Authors: Bruno G N Andrade; Flavia A Bressani; Rafael R C Cuadrat; Polyana C Tizioto; Priscila S N de Oliveira; Gerson B Mourão; Luiz L Coutinho; James M Reecy; James E Koltes; Paul Walsh; Alexandre Berndt; Julio C P Palhares; Luciana C A Regitano
Journal: J Anim Sci Biotechnol Date: 2020-02-24

8. Disentangling the mechanisms shaping the surface ocean microbiota.

Authors: Ramiro Logares; Ina M Deutschmann; Pedro C Junger; Caterina R Giner; Anders K Krabberød; Thomas S B Schmidt; Laura Rubinat-Ripoll; Mireia Mestre; Guillem Salazar; Clara Ruiz-González; Marta Sebastián; Colomban de Vargas; Silvia G Acinas; Carlos M Duarte; Josep M Gasol; Ramon Massana
Journal: Microbiome Date: 2020-04-20 Impact factor: 14.650

9. Microbiotyping the Sinonasal Microbiome.

Authors: Ahmed Bassiouni; Sathish Paramasivan; Arron Shiffer; Matthew R Dillon; Emily K Cope; Clare Cooksley; Mahnaz Ramezanpour; Sophia Moraitis; Mohammad Javed Ali; Benjamin S Bleier; Claudio Callejas; Marjolein E Cornet; Richard G Douglas; Daniel Dutra; Christos Georgalas; Richard J Harvey; Peter H Hwang; Amber U Luong; Rodney J Schlosser; Pongsakorn Tantilipikorn; Marc A Tewfik; Sarah Vreugde; Peter-John Wormald; J Gregory Caporaso; Alkis J Psaltis
Journal: Front Cell Infect Microbiol Date: 2020-04-08 Impact factor: 5.293

10. MetagenoNets: comprehensive inference and meta-insights for microbial correlation networks.

Authors: Sunil Nagpal; Rashmi Singh; Deepak Yadav; Sharmila S Mande
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971